Add agent skill for creating new model#2206
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new agent skill document intended to guide contributors/agents through adding support for new HuggingFace model architectures in the Python model builder and debugging ONNX-vs-PyTorch numerical parity issues.
Changes:
- Introduces
.github/skills/new-model/SKILL.mdwith file orientation, builder dispatch guidance, and an export/debugging workflow. - Documents MoE/QMoE-related implementation notes and common parity failure modes.
| - `make_moe_op` emits `MoE` (fp16) or `QMoE` (int4/int8). `make_qmoe_weights` | ||
| quantizes and packs each expert weight `[N, K]`. | ||
| - **CUDA QMoE weight encoding (critical):** the kernel is a CUTLASS fpA_intB | ||
| mixed GEMM that consumes **offline-prepacked** weights. The proven recipe | ||
| (see `_cutlass_prepacked_blockwise_quantize` in `base.py`): | ||
| 1. transpose weight to `[K, N]`; | ||
| 2. `onnxruntime...quantize_matmul_4bits(qw, w_T, scales, zp, block, N, K, is_symmetric=True)`; | ||
| 3. **keep the SIGNED scales** — do NOT `abs()` them. The kernel dequantizes as | ||
| `(q - 8) * scale`, and `quantize_matmul_4bits` folds the block-anchor sign |
There was a problem hiding this comment.
pack_weights_for_cuda_mixed_gemm existed in onnxruntime-gpu package >= 1.27. It is preferred way to pack weights (no dependency on tensorrt-llm). See https://github.com/microsoft/onnxruntime/blob/main/docs/contrib_ops/cuda/moe_qmoe.md
There was a problem hiding this comment.
Fixed in the latest commit. Removed the references to _cutlass_prepacked_blockwise_quantize, quantize_matmul_4bits, force_arch=80, and the signed-scale claim. Kept pack_weights_for_cuda_mixed_gemm (from onnxruntime-gpu >= 1.27) as the preferred, TRT-LLM-free packing approach, and updated the recipe to reflect the actual _symmetric_blockwise_quantize + pack_weights_for_cuda_mixed_gemm flow.
| - The QMoE node then uses the **default** `weights_prepacked` (omit the attribute; | ||
| default = prepacked). Do **not** set `weights_prepacked=0` (the raw-weight + | ||
| runtime-PrePack-hook path is finiteness-checked only and is not bit-correct). | ||
| - **CUDA QMoE only supports `block_size` 64 or 128.** Assert this in the builder. |
There was a problem hiding this comment.
It is limitation (block size 64/128) of cuda QMoE op. See https://github.com/microsoft/onnxruntime/blob/main/docs/contrib_ops/cuda/moe_qmoe.md for details.
There was a problem hiding this comment.
Fixed — updated the bullet to state that qmoe_block_size supports values 16, 32, 64, 128, or 256 (default 128 for CUDA/TRT-RTX, 32 otherwise), and that _symmetric_blockwise_quantize pads automatically.
| - The QMoE node then uses the **default** `weights_prepacked` (omit the attribute; | ||
| default = prepacked). Do **not** set `weights_prepacked=0` (the raw-weight + | ||
| runtime-PrePack-hook path is finiteness-checked only and is not bit-correct). |
There was a problem hiding this comment.
Removed the weights_prepacked bullet entirely — make_qmoe_op does not set that attribute and it has no presence in the codebase.
There was a problem hiding this comment.
The weights_prepacked attribute is a new attribute added in latest onnxruntime: microsoft/onnxruntime#28749.
Add a skill for agents to create new model.